-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backend: add prometheus metric for large snapshot duration. #7892
backend: add prometheus metric for large snapshot duration. #7892
Conversation
c01e297
to
94054a4
Compare
I manually create 2 snapshots which take 10 seconds each. I saw following metrics:
I am unsure why buckets 32-512 also have counts 2. EDIT: |
94054a4
to
6b5272e
Compare
mvcc/backend/metrics.go
Outdated
@@ -24,8 +24,18 @@ var ( | |||
Help: "The latency distributions of commit called by backend.", | |||
Buckets: prometheus.ExponentialBuckets(0.001, 2, 14), | |||
}) | |||
|
|||
snapShotDurations = prometheus.NewHistogram(prometheus.HistogramOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snapshotDurations
mvcc/backend/metrics.go
Outdated
Name: "backend_snapshot_duration_seconds", | ||
Help: "The latency distributions of Snapshot called by backend.", | ||
// 1 second -> 1024 seconds | ||
Buckets: prometheus.ExponentialBuckets(1, 2, 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1024 seconds is extreme, probably want to capture something like [10ms -- 30s]
mvcc/backend/metrics.go
Outdated
Namespace: "etcd", | ||
Subsystem: "disk", | ||
Name: "backend_snapshot_duration_seconds", | ||
Help: "The latency distributions of Snapshot called by backend.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latency distribution of backend snapshots.
mvcc/backend/metrics.go
Outdated
Name: "backend_snapshot_duration_seconds", | ||
Help: "The latency distributions of Snapshot called by backend.", | ||
// 1 second -> 512 seconds | ||
Buckets: prometheus.ExponentialBuckets(1, 2, 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want something like 10ms -- 1minute. 512 seconds is a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiang90 suggested to track large snapshot duration fro 1 second to around 10 min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can track low numbers that are common cases. when cluster starts to suffer we probably want to track large numbers. if the snap is 8gb with slow network 100s seconds is possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, should we do 10ms to 10mins?
edit: 10ms
6b5272e
to
42d2cc8
Compare
42d2cc8
to
230106d
Compare
How about 10ms to 10 mins? |
@gyuho I didn't find this one in CHANGELOG-3.2.md, it is merged in 3.2 right? Just want to confirm, I can add in the changelog and backport to 3.1 if it's safe. |
@wenjiaswe Yeah, it's in 3.2 not in 3.1 Let's backport and update changelog. |
CHANGELOG-3.2: update from #7892
FIXES #7878